This report outlines the results of a text mining analysis of the best-seller non-fiction book Invisible Women. Its exploration is meant to better understand some textual characteristics of feminist texts. Our text mining analysis shows that [ADD RESULTS HERE].
The goal of the analysis is to study textual data and extract some specific characteristics of feminist texts on gender bias in data.
The upcoming analysis is a comprehensive text mining exploration of Invisible Women following a four-part structure:
Our research questions are the following:
Invisible Women is an award-winning best seller published in 26 languages and sold in 122,255 copies less than a month after its released in March 2019 and before the 2019-lockdown. The book attracted immediate attention from the public and the media, all overwhelmed by the disclosure of the inherent data bias in a world designed for men [REFER THIS AS A QUOTE USING HW1 OF PROGRAMMING TOOLS]. In fact, the book exposes us all to the tremendous amount of situations in which decision-makers use a generic male default to implement public policies without considering or recognizing the seemingly not-so-obvious fact that: what works best for men does not necessarily works best for women.
The instantaneous response of the public to the release of the best-seller was a myriad of prize winning. From the Royal Society Insight Investment Science Book Prize, the FT & McKinsey Business Book of the Year, the Reader’s Choice Books Are My Bag Awards to the Times Current Affairs Book of the Year, Invisible Women achieved unanimity among its audience.
We chose to proceed to the text mining analysis of the content of this book because one of the team members had recently read it for a book club and suggested to dive deeper into the hot-topic of gender bias in data. Considering the orientation we chose to specialize in - Business Analytics - and the amount of time invested in learning about data and perfecting our skills in data science, we realized that we were missing one perspective: data analysis from a gender perspective. Therefore, the analysis of this book allows us to kill two birds with one stone: improving new data science skills (i.e. text mining) on the edge to Artificial Intelligence (AI) and study (text) data from a gender perspective lens.
The book Invisible Women under study is directly downloaded from [source: https://yes-pdf.com/book/113#google_vignette] in its PDF version. To upload it in Rstudio, we use the pdf_text utility from the pdftools package that extracts texts from pdf files. One of its advantages, is that is it easy to upload. On the cons side, the cleaning can sometimes be quite long and tedious.
Here are the book’s characteristics:
| Title | Author | Date | Parts | Chapters | Pages |
|---|---|---|---|---|---|
| Invisible Women | Caroline Perez Criado | 2019 | 6 | 16 | 399 |
The preface of the book starts on page 11 and the introduction on page 15.
After indicating where exactly the text under study starts and ends, and after extracting the chapter titles and organizing the text by chapters, we obtain the usable data to further analyze Invisible Women’s content.
The below output shows the beginning of the first five chapters of the book.
| document | text | part |
|---|---|---|
| CHAPTER 1 |
Can Snow-Clearing be Sexist? It all starte… |
1 |
| CHAPTER 2 |
Gender Neutral With Urinals In April 2017 … |
1 |
| CHAPTER 3 |
The Long Friday By the end of … |
2 |
| CHAPTER 4 |
The Myth of Meritocracy For most of th… |
2 |
| CHAPTER 5 |
The Henry Higgins Effect When Facebook… |
2 |
Tokenization is the method used to split a text into tokens. Our unit of analysis are words. Here, we tokenize the chapters (i.e. document) by space. To do so, we proceed to remove numbers, punctuation, symbols and separators because we believe that it will not affect our analysis and keeping them [KEEPING WHAT?] will not bring more insight.
The Quanteda package uses a corpus object.
The below summary shows that Invisible Women consists of 16 documents (i.e. chapters) and for each document, three columns indicate the number of tokens, the number of sentences as well as the number of token types per document.
#> Corpus consisting of 16 documents, showing 16 documents:
#>
#> Text Types Tokens Sentences
#> text1 1703 6340 185
#> text2 1884 7515 196
#> text3 2025 8428 206
#> text4 1866 7383 224
#> text5 1650 5829 166
#> text6 1584 5609 196
#> text7 1320 4385 97
#> text8 1306 4170 121
#> text9 2159 8775 322
#> text10 1982 7882 168
#> text11 1806 6771 180
#> text12 1445 5267 132
#> text13 1198 3871 116
#> text14 2000 7935 240
#> text15 851 2375 78
#> text16 1481 4847 98
To continue the cleaning process, we remove useless words that bring very to no information using the stop_words dictionary from the quanteda package and we map letters to lower cases since names (such as first or last names) are not of a specific importance in this book.
The advantage of removing stop words is that it reduces the dimension of the number of features/terms to analyze so that the focus of the analysis is on terms that bring relevant information. In this aim, we remove the word “chapter” which does not provide any value.
Lemmatization simplifies tokens by generating tokens from a dictionary and reduces the vocabulary to its simplest and meaningful essence. Consequently, the set of types in a corpus is shortened [THE NUMBER OF TOKEN TYPES IS REDUCED?]. For example, “started” and “starts” are reduced to “start” and have thus “start” as a lemma.
The below output displays for each chapter the lemmas of the first tokens as well as the total number of different lemmas by chapter. For example, chapter (i.e. text) one contains 2,413 different lemmas.
#> Tokens consisting of 6 documents.
#> text1 :
#> [1] "snow" "clear" "sexist" "start" "joke"
#> [6] "official" "town" "karlskoga" "sweden" "hit"
#> [11] "gender" "equality"
#> [ ... and 2,401 more ]
#>
#> text2 :
#> [1] "gender" "neutral" "urinal" "april" "veteran"
#> [6] "bbc" "journalist" "samira" "ahmed" "toilet"
#> [11] "screen" "negro"
#> [ ... and 2,755 more ]
#>
#> text3 :
#> [1] "friday" "day" "october" "icelandic"
#> [5] "friday" "supermarket" "sell" "sausage"
#> [9] "favourite" "ready" "meal" "time"
#> [ ... and 3,091 more ]
#>
#> text4 :
#> [1] "myth" "meritocracy" "20" "century"
#> [5] "female" "musician" "york" "philharmonic"
#> [9] "orchestra" "couple" "blip" "1950s"
#> [ ... and 2,676 more ]
#>
#> text5 :
#> [1] "henry" "higgins" "effect" "facebook" "coo"
#> [6] "sheryl" "sandberg" "pregnant" "time" "google"
#> [11] "pregnancy" "easy"
#> [ ... and 2,172 more ]
#>
#> text6 :
#> [1] "worth" "shoe" "bisphenol" "bpa" "scare"
#> [6] "1950s" "synthetic" "chemical" "production" "durable"
#> [11] "plastic" "find"
#> [ ... and 2,002 more ]
Stemming also simplifies tokens by reducing a word to its stem with simple rule based algorithm usig the token_wordstem() function. As lemmatization, stemming reduces the size of a vocabulary but in an inconsistent way. [EXPLAIN WHY IN AN INCONSISTENT WAY]
Since the interpretation of the tokens matter, we decide not to use the stemming in the rest of our analysis and only apply it here to demonstrate its purpose, since reducing a word to its stem does not guarantee meaningful tokens (e.g official is reduced to offici).
The below output displays the first twelve tokens reduced to their steam for each document. For example, “snow-clearing” was reduced to “snow-clear”.
#> Tokens consisting of 6 documents.
#> text1 :
#> [1] "snow" "clear" "sexist" "start" "joke"
#> [6] "offici" "town" "karlskoga" "sweden" "hit"
#> [11] "gender" "equal"
#> [ ... and 2,401 more ]
#>
#> text2 :
#> [1] "gender" "neutral" "urin" "april" "veteran"
#> [6] "bbc" "journalist" "samira" "ahm" "toilet"
#> [11] "screen" "negro"
#> [ ... and 2,755 more ]
#>
#> text3 :
#> [1] "friday" "day" "octob" "iceland"
#> [5] "friday" "supermarket" "sell" "sausag"
#> [9] "favourit" "readi" "meal" "time"
#> [ ... and 3,091 more ]
#>
#> text4 :
#> [1] "myth" "meritocraci" "20" "centuri"
#> [5] "femal" "musician" "york" "philharmon"
#> [9] "orchestra" "coupl" "blip" "1950s"
#> [ ... and 2,676 more ]
#>
#> text5 :
#> [1] "henri" "higgin" "effect" "facebook" "coo"
#> [6] "sheryl" "sandberg" "pregnant" "time" "googl"
#> [11] "pregnanc" "easi"
#> [ ... and 2,172 more ]
#>
#> text6 :
#> [1] "worth" "shoe" "bisphenol" "bpa" "scare"
#> [6] "1950s" "synthet" "chemic" "product" "durabl"
#> [11] "plastic" "find"
#> [ ... and 2,002 more ]
Now, without considering the stemming, we compute the Document-Term-Matrix that will be useful throughout the analysis.
The below snapshot of the matrix indicates that after cleaning and lemmatizing, there are 5,742 features to be analyzed and that the DTM is sparse at 83.76% (i.e. contains mostly zeros). The matrix displays the frequency of features (i.e. terms or words here) by documents (i.e. texts or chapters here). For example, the first row indicates that the word sexist is found twice and the word town is found six times and, the first column indicates that the word sexist is found in chapter one, chapter four and chapter 6.
#> Document-feature matrix of: 16 documents, 5,742 features (83.76% sparse) and 0 docvars.
#> features
#> docs sexist joke town karlskoga sweden hit initiative lens harsh
#> text1 2 1 6 5 3 1 1 1 1
#> text2 0 0 0 0 4 1 0 0 0
#> text3 0 0 0 0 7 3 0 0 0
#> text4 4 0 0 0 0 0 1 0 1
#> text5 0 0 0 0 0 0 0 0 0
#> text6 1 0 0 0 0 1 0 0 0
#> features
#> docs glare
#> text1 1
#> text2 0
#> text3 0
#> text4 0
#> text5 0
#> text6 0
#> [ reached max_ndoc ... 10 more documents, reached max_nfeat ... 5,732 more features ]
To proceed to the Exploratory Data Analysis (EDA), we use the quanteda package.
To start the EDA, we proceed to visually assess which words are expected to be found the most regularly in the corpus. Thus, the CoW plot is a visual representation of term frequencies in which the size and position of terms are proportional to their frequencies. Nevertheless, this visualization is more graphic than informative since the only information we can extract from is that the terms with the largest font sizes are the most frequent in the corpus. Indeed, from the below plot, we see that woman is the most used (largest font size and centered position) term in the corpus, followed by female, datum, male, find, time, gender and study.
To generate the CoW plot, we use the DTM that was obtained after cleaning and lemmatizing but without considering the stemming, see Document-Term Matrix in Section 2 Data Structuring and Cleaning.
To assess more accurately the frequency of the terms in the corpus, we compute the global frequencies. The following graphical representation displays the ten most frequent terms in the corpus. As inferred previously from the CoW, we see that woman is the most frequent term overall followed by female, datum, male, find, time, gender and study.
Global frequencies indicate that woman is by far the most frequent term and that the frequency differences between the following nine terms are much less extreme (i.e. their ranks are less distinguishable).
In addition, we see that the term data was lemmatized into datum.
To deep dive into these frequencies, we then display a Term-Frequency (TF) table providing information on term frequencies, their rank and document frequencies. Indeed, the feature lists the lemmatized tokens, the frequency provides the number of times the term is found in the corpus (i.e. global frequency), the rank sorts the terms by decreasing frequencies (i.e. rank is inversely proportional to the frequency), the docfreq indicates the number of documents in which the token is found (i.e. document frequency).
The table below shows that woman appears 1594 times in the corpus and in all 16 documents (docfreq = 16) which means that it is not a document-specific term. Moreover, it appears four times more than the second most frequent term female.
| feature | frequency | rank | docfreq |
|---|---|---|---|
| woman | 1594 | 1 | 16 |
| female | 395 | 2 | 15 |
| datum | 358 | 3 | 16 |
| male | 334 | 4 | 16 |
| find | 298 | 5 | 16 |
| time | 260 | 6 | 16 |
| gender | 256 | 7 | 16 |
| study | 224 | 8 | 15 |
| gap | 176 | 9 | 16 |
| sex | 172 | 10 | 15 |
The below graph gives an overall graphical view of the previous table indicating 1) the term-frequency and 2) whether a word is specific to a document or not (i.e. document-frequency). As observed previously, we see that woman is the most frequent term in the corpus and that it is not document-specific since it is the most frequent term over all chapters of the book. On the contrary, the terms trial and tax are less frequent overall and also less frequent in documents meaning that they are more specific to some chapters (i.e. documents) of the book.
Since we observe that woman has a very large term-frequency and is not document-specific, we decide to remove this term for the rest of our analysis because we believe it will hide some important and interesting insights as it will always appear as the top frequency in all documents.
[Can we use a nicer looking theme ?]
The plot below shows the ten most frequent terms for chapters of the book [how are the chapters chosen?]. It would have been interesting to have the top frequencies of each document but there would have been an information overload as there are 16 chapters so we decided not to display it.
[I dont understand how these ten most frequent terms are chosen ? they are not the ten overall msot frequent ! it’s confusing because they dont match the order of the ten msot frequent terms overall !!!]
First, we see that Chapter 10 (i.e. text10), namely The Drugs Don’t Work, is associated with sex, drug and study. In the previous plot (TF versus DF), we saw that sex and study are not document-specific but that drug is more document-specific. Therefore, we can assume that drug is more specific to chapter 10 then the two other terms. Nevertheless, we do not want to jump to conclusion right now and the document-specificity of terms will be explored deeper later. Second, chapter 13, From Purse to Wallet seems to only be associated to tax and as well as drug, tax is more document-specific. Third, chapter 14, Women’s Rights are Human Rights is only associated to female. Fourth, chapter 3, The Long Friday is related to pay, leave and time and lastly, chapter 4, Myth of Meritocracy is equally associated to female and male.
Even though some of those words are informative, others are much less insightful. Indeed, it is not enlightening to have female associated to one document as the whole book is about feminism.
The Zipf’s law shows the distribution of words used in a corpus by plotting the term-frequencies against their ranks and, it says that the frequency of a token is inversely proportional to its rank. Therefore, the below plot (on a log-log scale) shows a negative linear relation (the original distribution (non log-log) is a negative exponential function).
This plot shows that female, datum, male, find and time are the most frequent terms of the corpus probably indicating that they are not chapter-specific but frequent in all chapters of the book. Indeed, although the Zipf’s Law do not provide precise information on document specificity, there is a very low probability that these terms are specific to one or few chapters that are sufficiently lengthy to make them appear as much. Therefore, according to the Zipf’s law, these terms are very frequent in the overall corpus and consequently could hide some meaningful information as they are not considered stop words [can we explain this a bit better?]. The Zipf’s law now leads us to move to look at weighted frequencies.
#> Warning: Removed 5 rows containing missing values (geom_label).
The TF-IDF matrix is a weighted document-feature matrix displaying term frequency–inverse document frequency that shows how important a word is to a document. In other words, it is used to re-balance a term frequency with respect to its document-specificity.
The below output is the document-feature matrix of 16 documents and 5’741 features which shows the weighted frequencies of each token by chapters of the book. The sparsity (83.77%) increases a bit [add the amount of the increase] compared to the DTM matrix when women is not removed. [explain why it induces an increase in sparsity]
#> Document-feature matrix of: 16 documents, 5,741 features (83.77% sparse) and 0 docvars.
#> features
#> docs sexist joke town karlskoga sweden hit initiative lens
#> text1 0.718 0.903 5.42 6.02 1.08 0.204 0.505 0.903
#> text2 0 0 0 0 1.44 0.204 0 0
#> text3 0 0 0 0 2.51 0.612 0 0
#> text4 1.436 0 0 0 0 0 0.505 0
#> text5 0 0 0 0 0 0 0 0
#> text6 0.359 0 0 0 0 0.204 0 0
#> features
#> docs harsh glare
#> text1 0.903 0.903
#> text2 0 0
#> text3 0 0
#> text4 0.903 0
#> text5 0 0
#> text6 0 0
#> [ reached max_ndoc ... 10 more documents, reached max_nfeat ... 5,731 more features ]
The following plot shows the twenty largest TF-IDF and their respective terms. Note that it is equal to the result of computing for each term the maximum TF-IDF over all chapters of the book. The output shows that tax has the largest TF-IDF (i.e. has the largest weighted frequency) in at least one chapter of the book and that trial, drug and dummy appear also quite often in the corpus and in few documents.
After looking at the overall largest TF-IDFs in the corpus, we look at the ten largest TF-IDFs by chapters of the book. The following plot shows that chapter 10, The Drugs Don’t Work, is now associated with trial and drug, that tax is really frequent in chapter 13 From Purse to Wallet so we can now state that it is specific to this chapter and that chapter 14, Women’s Rights are Human Rights, is in fact more specifically associated with the term interrupt than with the term female as shown in the Term-Frequency by Document section of the Exploratory Data Analysis (EDA).
[what’s the firm term ? vr ? ]
The keyness measure is a chi-square test of independence indicating whether some terms are characteristic of a target compared to a reference. We illustrate the keyness by investigating why the vague term interrupt is associated to chapter 14 Women’s Rights are Human Rights. Thus, we compute its keyness and compare it to the other chapters (i.e. reference).
The plot below allows us to see that this chapter is characterized by the terms party, politician, election or even candidate and to conclude at first glance that this chapter is more about political topics than the rest of the corpus.
Then, we compute for each chapter of the book the keyness of terms in order to better understand what each chapter is about. The following visualization is an animated illustration (in a gif format). Each chapter is then at some point the target and the reference.
Chapter 1 Can Snow-Clearing be Sexist? displays terms related to public transportation and cities’ infrastructure, chapter 2 Gender Neutral With Urinals mentions terms in connection with public spaces and dangerous behaviors, chapter 3 The Long Friday seem to be about parental benefits in the workplace, chapter 4 The Myth of Meritocracy elaborates on success, chapter 5 The Henry Higgins Effect is a bit more difficult to grasp but uses terms related to chemicals and effect on health such as cancer, chapter 6 Being Worth Less Than a Shoe is also less specific to one vocabulary type but seems to be about profession and precariousness, chapter 7 The Plough Hypothesis uses terms related to agriculture, chapter 8 One-Size-Fits-Men seems to be about technologies like smartphones and data science, chapter 9 A Sea of Dudes uses vocabulary connected to automobiles, chapter 10 The Drugs Don’t Work is about clinical trials, chapter 11 Yentl Syndrome uses a medical vocabulary, chapter 12 A Costless Resource to Exploit seems to be about economics, chapter 13 From Purse to Wallet is connected to household consumption, chapter 14 Women’s Rights are Human Rights is about politics, chapter 15 Who Will Rebuild? implies a topic surrounding rebuilding a more peaceful world and chapter 16 It’s Not the Disaster that Kills You evolves around extreme poverty in the world.
Here, we look at how words co-occur and how inter-connected they are and to do so, we first compute the co-occurrences between terms.
The feature co-occurrence matrix is a 5,741 by 5,741 matrix (32,959,081 elements) in which is displayed the number of times two terms co-occur (i.e. co-occurrence frequency) in the corpus. Because of the large size of the matrix, we decide to reduce its size by keeping only co-occurrences greater than 110. The latter condition allows us to focus our attention on terms that appear the most together in the corpus, implying that they have a specific connection of interest in the context of the book. After applying this condition to the matrix, we get the following smaller feature co-occurrence matrix of dimensions 20 by 20 features (400 elements).
#> Feature co-occurrence matrix of: 20 by 20 features.
#> features
#> features datum male find time gender study gap sex pay report
#> datum 4481 8548 6860 5544 5961 5258 4026 4303 3080 3346
#> male 8548 4854 7730 5167 6103 6348 3859 5317 2285 2995
#> find 6860 7730 3485 5218 4917 6060 3362 5104 3017 2810
#> time 5544 5167 5218 3000 4469 3644 3291 2252 5227 2250
#> gender 5961 6103 4917 4469 2529 3490 2884 2746 3065 2414
#> study 5258 6348 6060 3644 3490 2622 2456 5277 1636 2181
#> gap 4026 3859 3362 3291 2884 2456 1008 1864 2461 1410
#> sex 4303 5317 5104 2252 2746 5277 1864 3943 562 1794
#> pay 3080 2285 3017 5227 3065 1636 2461 562 2878 1144
#> report 3346 2995 2810 2250 2414 2181 1410 1794 1144 876
#> [ reached max_feat ... 10 more features, reached max_nfeat ... 10 more features ]
Using the above feature co-occurrences matrix, we generate a network (object) displaying visually the inter-connections of interest between the co-occurring terms appearing more than 110 times in the corpus. To generate a readable network of co-occurrences, we add a second condition on the co-occurring features and we keep only the co-occurrences greater than 2,100.
The below network reveals that the terms datum, male and find are central and co-occur a lot with the surrounding terms. The output shows a surprising finding which is that for a book named Invisible Women, the term female is not a central term co-occurring the most with other terms. Therefore, to dig deeper into this finding, we decide to re-introduce the term woman and we find an even more surprising result which is that woman is still not at the center of the co-occurrences despite its significantly large frequency (1,594 out of 36,160 or 4.4% of all frequencies).
After investigating term co-occurrences, we look at how terms move together in the book. Dispersion or X-Ray plots inspect where a specific token is used in each text by locating a pattern in each text.
The below lexical dispersion plot shows how the terms female and male move along the chapters. First, male is found in all chapters but only once in chapter 6 Being Worth Less Than A Shoe whereas as female is only not found in chapter 15, namely Who Will Rebuild. When considering the implication of the title of this chapter, this finding seems a bit curious. Second, these two terms often seem to appear together at some point in chapters. Third, we also see that they are both present in chapters 4 The Long Friday and 14 Women’s Rights are Human Rights at a higher frequency but not necessarily used at the same location suggesting that the author might compare the two more in these chapters.
After exploring the movements of female and male, we look at the movements between female and sex. Note that here sex is reduced to its lemma so it could refer to the gender, the nature of a relation or any other terms related to sexuality. The below plot shows that the term sex is more specific to chapter 10 The Drugs Don’t Work and that female and sex are not necessarily associated.
Lastly, we explore the movements of male and sex. This following plot reveals that male and sex seem to be used together most often in Chapter 10 The Drugs Don’t Work. Furthermore, using previous results where we find that chapter 10 is associated with trial and drug, we can conclude that this chapter probably focuses on clinical trials and gender.
Lexical diversity is a diversity index that measures the richness of the vocabulary in one document.
The TTR is a diversity measure indicating a document’s richness in the number of token types. The more types of tokens are found in a document, the richest is the vocabulary of this specific document. The closest to 1 the TTR is, the richest the vocabulary of a document is. We need to be careful with the TTR measure because it is dependent on the length of the document. TTR is computed using the document-term matrix.
The below graph sorts chapters of the book by descending TTR. According to TTR, chapter 15 Who Will Rebuild? has the richest vocabulary among all chapters of the book with a TTR of 0.592 and chapter 3 The Long Friday has the poorest with a TTR of 0.374. Overall, the richness of vocabulary is not very diverse which could be explained by the fact that the author focuses on the specific gender issue, therefore using repetitively gender-specific terms.
The Moving-Average Token-Type Ratio is an average of the Token-Type Ratio. It is an algorithm using windows of the text to compute the TTR and repeating several times over different windows of the same size the TTR computation. The advantage of the MATTR is that it is less dependent on the length of the document than the TTR. Note that a too large window can produce an error since no local TTR can be computed and a too small window results in pointless values (always 1). MATTR is computed using ordered tokens.
The below graph shows the the Moving-Average Token-Type Ratio and shows that the MATTR among chapters are very similar and that it ranges from 0.725 to 0.798 over all chapters of the book. According to the MATTR, chapter 15 Who Will Rebuild? still has the richest vocabulary with an MATTR of 0.798.
This section dives deeper into the analysis of the content of the corpus in which each part is supported by well-designed relevant charts and graphs using the ggplot2, sentimentr, reshape2, quanteda.textmodels, seededlda and text2vec packages.
The previous exploratory data analysis sheds light on overall and document-specific findings related to terms and vocabulary used throughout the book. Indeed, following a top-down approach, we examine terms specific to the general corpus which unsurprisingly reveal a gender-centric vocabulary and then look more closely at the focus of each chapter on the gender discrimination issue on which Invisible Women devotes all its attention.
Looking at the complexity of the book, Invisible Women is built of 16 chapters divided in six parts and tells its story over 280 pages. For the purpose of the following analysis, we take care of talk about the complexity of the book: specify the length of the text before and after the cleaning ? assess the difference (big or small) and interpret (if big, most words are bringing no information, if small, most terms are insightful) ? [DO YOU HAVE ANY OTHER IDEAS THAT COULD ASSESS THE COMPLEXITY OF THE TEXT ?] -> talk about the uniqueness of the data : [NOT SURE WHAT TO MENTION HERE BUT MUST BE MENTIONNED]
The subsequent analysis is as follows. We first study the sentiment of the corpus by extracting for each chapter its average sentiment using a qualitative and a quantitative approach. Then, we focus our attention on the similarity between terms to study their context and between chapters to better understand their associated topics of interest. From clustering of term similarities, we continue through topic modelling using two approaches. Finally, we end with an unsupervised and supervised learning methods to represent terms and documents in dimensions.
Now that the content of the corpus is cleaned and that we explored its content, we proceed to its further analysis with first its sentiment analysis (i.e. opinion mining) which qualifies or quantifies the sentiment emerging from one text (i.e. chapter). To proceed to the sentiment analysis, we use two approaches. The first one uses qualifiers (i.e. dictionary-based) and the second one uses numerical values (i.e. value-based).
When analyzing the sentiment emerging from a document, we care to not remove stop words since they might be in the sentiment dictionary and provide useful insight.
The dictionary-based sentiment analysis matches tokens from each document to a reference dictionary with token values and look for word polarity (i.e. association to a sentiment). The dictionary matches terms to a positive, negative, neg_positive or neg_negative sentiment. For simplicity purposes, we only consider the positive and negative sentiment in the rest of this analysis. Note that the sentiment is the average over token values of the document.
The disadvantage of the dictionary-based sentiment analysis is that the negative forms of words are not taken into consideration. For example, in the sentence I don’t enjoy the show, the sentence will be considered positive because it will not consider the contraction don’t but only the word enjoy.
The below interactive graph shows for each chapter of the book the proportion of terms matched with a positive and negative sentiment. For example, chapter two Gender Neutral With Urinals is found to have 177 terms matched with a positive sentiment and 408 with a negative one.
Overall, positive and negative sentiments are found in all 16 chapters of the book although more terms are recognized as negative (3,596) than as positive (2,661) indicating that the frequency of negative terms is higher than the one of positive terms.
The valence shifters approach uses positive and negative sentiment scores (i.e. value-based) to extract the sentiment of a document. Here, we use two dictionaries, a polarized words dictionary where we find a list of terms communicating a positive or negative attitude and a valence-shifters dictionary which provides terms that alter or intensify the meaning of the polarized words.
The next table shows the first five words of the polarized words dictionary and their respective numerical scores.
| token | value |
|---|---|
| a plus | 1.00 |
| abandon | -0.75 |
| abandoned | -0.50 |
| abandoner | -0.25 |
| abandonment | -0.25 |
The next table shows the first five words of the valence-shifters dictionary and their respective numerical scores.
| token | value |
|---|---|
| absolutely | 2 |
| acute | 2 |
| acutely | 2 |
| ain’t | 1 |
| aint | 1 |
To proceed to the valance-shifters sentiment analysis, we first extract the sentences from the text and compute their sentiment value. Here, we do not assign weights to certain types of sentences (e.g. questions) since we believe that the sentence type does not have a particular influence on our analysis.
The below output displays the sentiment values of the first five sentences of the corpus, respectively of chapter one, and indicates for each the sentiment emerging from the terms of the sentences. Anything numerical value below 0.05 is considered negative and any value above 0.05 is considered positive. Anything in between is considered neutral. Thus, the first, fourth and fifth sentences are negative (respectively -0.408, -0.296 and -0.535Because the column word_count had NAs, we remove rows that have no available information since no sentiment can be extracted. ) and the second and third ones are positive (respectively 0.245 and 0.096). Note that because the column word_count had NAs, we remove rows that have no available information since no sentiment can be extracted.
| document | sentence_id | word_count | sentiment |
|---|---|---|---|
| Chapter 1 | 1 | 6 | -0.408 |
| Chapter 1 | 2 | 6 | 0.245 |
| Chapter 1 | 3 | 33 | 0.096 |
| Chapter 1 | 4 | 33 | -0.296 |
| Chapter 1 | 5 | 14 | -0.535 |
Since sentiment changes as sentences change, we zoom out to look at the sentiment score by chapter of the book. The ave_sentiment column gives the average sentiment score by chapter.
The next interactive graph displays the average sentiment scores by chapter in a decreasing fashion. According to it, chapter 4 The Myth of Meritocracy has the greatest positive average (0.118) and chapter 16 It’s Not The Disaster That Kills You has the the most negative average sentiment. In total, five chapters have a positive sentiment (i.e. above 0.05), four have a negative sentiment (i.e. below -0.05) and seven chapters have a neutral sentiment (i.e. between -0.05 and 0.05).
Similarity is a numerical value used to measure proximity between terms, to see if they are used in the same context or between documents, to see if they use the same terms. Note that similarity is dependent on the types of tokens found in the corpus. If it uses mostly one vocabulary, it might be possible that most terms are similar.
To compute similarities, we use three different measures namely the Jaccard Index, Cosine Similarity and Euclidean Similarity all three using the term-frequency inverse-document frequency.
We first compute the similarity between chapters of the book to investigate whether they use the same token types.
We first compute the Jaccard index matrix displaying the relative number of common words using the TF-IDF matrix. Note that the Jaccard coefficient considers only once each token type (i.e. set model).
From the Jaccard similarity matrix, where similarities are based on the Jaccard coefficient (i.e. relative number of common words) by document, we get the below heatmap in which chapters are likely to use similar terms. A red square indicates a strong similarity (e.g. 1) whereas a dark blue square indicates no similarity. Therefore, the heatmap shows that chapter 15 Who Will Rebuild is the text with the least similarity with other chapters (score closer to 0). Also, we observe that chapters 10 The Drugs Don’t Work and 11 Yentl Syndrome seem to use a bit more similar terms.
Then, we compute the cosine similarity matrix which computes the similarity between two vectors of an inner product space. Note here that the similarity is independent of the vector length and that only the cosine angle between two weighted term-frequency vectors is determinant.
Compared to the heat map generated using the Jaccard index, the next heatmap displays much more no similarity (darker blue squares) between terms used in different chapters than the previous one. However, chapter 10 and 11 are still shown as sharing less similarity.
Finally, we compute the Euclidean-based similarity matrix using the Euclidean distance. The below heatmap shows again a different output. Indeed, less documents are shown as having no similarity and more chapter similarities stand in a middle in which we cannot infere on similarity. For example, chapter 15 Who Will Rebuild and 16 It’s Not The Disaster That Kills You do not seem to either show similarity or dissimilarity. On the contrary, chapters 12 A Costless Resource To Exploit and 15 Who Will Rebuild are slightly more using similar terms.
To proceed to clustering documents, we need to build the dissimilarities and/or Vector Space Model (VSM) on which we can apply the clustering methods, hierarchical clustering based on distances and K-mean partitioning based on features.
Clusters are difficult to interpret. This is why we will look at the largest term frequencies of clusters to better understand what is the common denominator for the grouping.
Hierarchical clustering is based on distances and applied on the dissimilarities using the function hclust(). The hierarchical approach assigns each document to its own cluster and then at each iteration the two most similar chapters are grouped together in one cluster. The iteration continues until all chapters belong to a cluster.
The inverted Jaccard dissimilarity matrix shows that there are two clusters main clusters. One cluster groups chapter 15 Who Will Rebuild and chapter 16 It’s Not The Disaster That Kills You and the other one groups the rest of the book. Moreover, inside the second larger cluster, we see that there are two sub-clusters.
The following table indicates to which cluster a chapter belongs.
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text1 , text2 , text3 , text7 , text12, text14 | text4 , text5 , text6 , text9 , text10, text11 | text8 | text13 | text15 | text16 |
For interpretation purposes, we extract the ten words that are the most frequent in each cluster to identify a common denominator between chapters. According to the most used terms in each cluster, chapters 1, 2, 3, 7, 12, 14 share terms regarding public transport or spaces (cluster 1), chapters 4, 5, 6, 9, 10 and 11 are about medical or technological trials (cluster 2) and chapter 8, 13, 15 and 16 belong to their own clusters.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| transport | trial | keyboard | tax | rebuild | refugee |
| bus | drug | corpus | poverty | peace | violence |
| stove | dummy | voice | earner | orleans | shelter |
| interrupt | vr | pianist | marry | disaster | disaster |
| toilet | tech | algorithm | household | agreement | homeless |
| pedestrian | crash | recognition | file | miami | homelessness |
| travel | chemical | dataset | zombie | displace | conflict |
| party | pain | handspan | couple | fordham | ebola |
| plough | meritocracy | phone | income | gujarat | cyclone |
| agriculture | clinical | inch | youth | hurricane | camp |
The inverted cosine dissimilarity matrix seems to show that there are again two main clusters and for each two smaller sub-clusters. One cluster groups chapters 10, 11, 5, 6, 8 and 9 and the other one groups chapters 7, 11, 12, 15, 16, 13, 3, 2, 4 and 14.
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text1 , text2 , text15, text16 | text3 , text12, text13 | text4 , text14 | text5, text6, text8, text9 | text7 | text10, text11 |
For interpretation purposes, we extract the ten most frequent terms of each cluster to identify a common denominator between chapters. According to the most used terms in each cluster, chapters 1, 2, 15 and 16 share terms regarding public transport or spaces and violence (cluster 1), chapters 3, 12 and 13 are using terms regarding households, families and economics (cluster 2), chapters 4 and 14 seem to share a political vocabulary (cluster 3), chapters 5, 6, 8, 9 use technical terms (cluster 4), chapter 7 has its own cluster about agriculture (cluster 6) and chapters 10 and 11 use a medical vocabulary (cluster 6).
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| transport | tax | interrupt | dummy | stove | trial |
| bus | pay | candidate | vr | plough | drug |
| toilet | gdp | party | crash | agriculture | clinical |
| pedestrian | poverty | meritocracy | chemical | stave | pain |
| travel | childcare | politician | ppe | farmer | cell |
| violence | household | election | boler | agricultural | blood |
| disaster | marry | hire | worker | crop | fda |
| snow | offer | mp | stoffregen | farm | heart |
| sánchez | week | aw | tech | strength | disease |
| madariaga | carer | teach | keyboard | doss | medication |
The Euclidean dissimilarity matrix seems to show that there is sequence of clusters which is a completely different clustering pattern compared to the other two distance measures. This pattern seems to indicate that each chapter is rather independent from the others and do not share a lot of similarities accross chapters of the book.
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text1 | text2 , text3 , text4 , text5 , text6 , text7 , text8 , text11, text12, text15, text16 | text9 | text10 | text13 | text14 |
For interpretation purposes, we extract the ten most frequent terms of each cluster to identify a common denominator between chapters. Here, clusters 1 and 3
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| pedestrian | toilet | dummy | trial | tax | interrupt |
| transport | stove | vr | drug | poverty | party |
| snow | violence | crash | cell | earner | candidate |
| sánchez | worker | boler | clinical | marry | politician |
| madariaga | pay | stoffregen | fda | household | election |
| travel | bus | tech | nih | file | mp |
| de | chemical | motion | blood | zombie | aw |
| trip | girl | seat | medication | couple | representation |
| favela | plough | belt | adr | income | political |
| clear | meritocracy | tin | animal | youth | ambition |
K-means is applied on the features (i.e. terms?) using TF-IDF.
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text14 | text3 , text4 , text5 , text6 , text7 , text8 , text11, text12, text13, text15, text16 | text1 | text2 | text9 | text10 |
Using kmean we have the cluster 2 that is different from other method, with top terms related to politics.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| interrupt | tax | pedestrian | toilet | dummy | trial |
| party | stove | transport | bus | vr | drug |
| candidate | pay | snow | transport | crash | cell |
| politician | worker | sánchez | girl | boler | clinical |
| election | violence | madariaga | transit | stoffregen | fda |
| mp | chemical | travel | urinal | tech | nih |
| aw | plough | de | harassment | motion | blood |
| representation | meritocracy | trip | loukaitou | seat | medication |
| political | disaster | favela | sideris | belt | adr |
| ambition | agriculture | clear | sexual | tin | animal |
Now we analyze similarities between words through chapters (i.e. documents). Because of the large number of words, we focus on word frequency ranks smaller or equal to 40 (i.e. 40 most frequent words).
What words are similar ? It means they are used in similar proportion through documents.
Using co-occurrences to cluster features requires transforming co-occurrences into dissimilarities. Object needs to be the tokens because co-occurrences needs the token order and thus cannot be computed on a BOW object.
Are terms co-occurring ? if yes, a lot ?
Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. There are two approaches: Latent Semantic Analysis model and Latent Dirichlet Allocation model.
Both models can be applied on DTM or TF-IDF matrices.
The Latent Semantic Analysis (LSA) is a dimension reduction method that decomposes the DTM or TF-IDF matrices in three sub-matrices around a pre-determined number of topics. Here, we proceed to LSA on the document-term (i.e. document-feature) matrix and then extract the matrices of the LSA decomposition.
The first dimension of LSA is associated with the document lengths (i.e. sum of lines of DTM) this is why it is often ignored.
The below output shows the association between the first six chapters and the ten topics. For example, text1 is associated the most with dimension (i.e. topic) 4 and the least associated with dimension (i.e. topic) 6.
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> text1 0.2018 0.11752 -0.19184 0.72710 -0.27103 -0.2251 0.476481
#> text2 0.1956 0.08192 -0.16037 0.42722 -0.09544 0.1764 -0.622684
#> text3 0.1485 0.04123 -0.12506 0.09144 0.26435 0.0798 -0.011243
#> text4 0.1923 0.07261 -0.11759 -0.02451 0.21483 0.8083 0.417789
#> text5 0.1302 0.03136 0.00308 0.05090 0.07383 0.1504 -0.208109
#> text6 0.1060 0.02393 -0.05058 0.07369 0.07692 0.1108 -0.189653
#> text7 0.0794 0.01738 -0.03277 0.05171 0.08908 0.0777 -0.079578
#> text8 0.0774 0.02957 -0.00125 0.00463 0.03408 0.0857 0.010204
#> text9 0.5744 0.58600 0.51451 -0.18195 -0.05804 -0.1336 0.018497
#> text10 0.5553 -0.73682 0.14966 -0.07018 -0.05705 -0.0955 0.073838
#> text11 0.2399 -0.23374 0.03162 0.02098 0.02088 0.0466 -0.096923
#> text12 0.0672 0.02220 -0.05154 0.07288 0.14703 -0.0243 -0.000291
#> text13 0.0951 0.04021 -0.13836 0.09835 0.84168 -0.3760 0.039819
#> text14 0.3172 0.16283 -0.76817 -0.45433 -0.20878 -0.1724 0.006966
#> text15 0.0293 0.00633 -0.02394 0.03464 -0.00029 0.0108 -0.040004
#> text16 0.1026 0.01312 -0.06164 0.08976 0.01562 0.0964 -0.326218
#> [,8] [,9] [,10]
#> text1 0.08506 0.05690 0.039598
#> text2 -0.38771 -0.22719 0.076879
#> text3 0.24046 0.34038 -0.560136
#> text4 -0.20122 -0.10183 0.082302
#> text5 0.41032 0.47366 0.619532
#> text6 0.19092 0.28733 -0.000854
#> text7 0.69499 -0.69244 -0.007632
#> text8 0.06522 0.01795 -0.018213
#> text9 -0.05664 -0.03486 -0.025618
#> text10 -0.06282 -0.06250 0.147925
#> text11 0.04556 0.11108 -0.439631
#> text12 0.07124 0.05643 -0.103670
#> text13 -0.19409 -0.10575 0.173804
#> text14 -0.00778 -0.01062 0.039224
#> text15 0.00394 -0.00926 -0.027555
#> text16 0.00268 0.01373 -0.169465
To verify if the first dimension is associated with the document length we look at the below scatter-plot. We see that dimension 1 is negatively correlated with the number ok tokens, which confirms that the first dimension is associated with the document lenght.
[Improve plot: add a straight line showing the correlation]
The below output shoes the association between the first six most frequent words and the ten topics. For example, the term sexist is associated the most with topic 10.
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> sexist 0.01200 0.00587 -0.019366 -0.000971 -0.000782 0.014755
#> joke 0.00810 0.00806 0.003833 0.006891 -0.004626 -0.005254
#> town 0.01863 0.01480 -0.007561 0.052847 -0.023678 -0.021740
#> karlskoga 0.01404 0.00898 -0.015193 0.061275 -0.025402 -0.021981
#> sweden 0.01626 0.00588 -0.021851 0.017419 0.005159 -0.000378
#> hit 0.00758 -0.00308 -0.002666 0.006217 0.002641 0.002664
#> initiative 0.00961 0.00607 -0.000971 0.007421 0.006773 0.006905
#> lens 0.00810 0.00806 0.003833 0.006891 -0.004626 -0.005254
#> harsh 0.00411 0.00218 -0.003676 0.008881 -0.000790 0.008543
#> glare 0.00281 0.00160 -0.002891 0.010112 -0.001743 -0.003654
#> [,7] [,8] [,9] [,10]
#> sexist 0.01335 0.010783 -0.014006 0.003486
#> joke 0.00758 0.000459 0.000367 0.000247
#> town 0.04406 0.007329 0.005101 0.003749
#> karlskoga 0.04864 0.009160 0.006312 0.004669
#> sweden -0.01113 0.001419 0.010886 -0.027996
#> hit -0.00826 0.002794 0.004926 -0.012127
#> initiative 0.00372 0.038046 -0.037837 -0.002577
#> lens 0.00758 0.000459 0.000367 0.000247
#> harsh 0.01369 -0.001876 -0.000748 0.002156
#> glare 0.00729 0.002525 0.001886 -0.001133
To actually be able to interpret the topics (i.e. dimensions) of the LSA, we look at the ten terms with the largest values and the ten terms with the lowest values. We take a look at dimension 4 and 5.
According to the below output, topic 4 is positively associated to public, transport, sexual, girls, bus, spaces, data, toilets, harassment, toilet and negatively associated to maternity, drug, heart, pay, trials, paid, hours, sex, unpaid, leave. Therefore, documents that have a large dimension 4 use more these first terms and less these last terms.
#> transport pedestrian bus travel
#> 0.2744 0.2328 0.2146 0.1770
#> snow sánchez madariaga toilet
#> 0.1716 0.1716 0.1716 0.1474
#> trip de ambition aw
#> 0.1396 0.1381 -0.0689 -0.0919
#> dummy mp politician representation
#> -0.0920 -0.0924 -0.0971 -0.0977
#> election candidate party interrupt
#> -0.0978 -0.1228 -0.1354 -0.1608
According to the below output, topic 5 is positively associated to female, public, trials, politicians, found, representation, party, government, political, study and negatively associated to boler, male, motion, seat, vr, dummy, car, body, tech, data.Therefore, documents that have a large dimension 5 use more these first terms and less these last terms.
#> tax poverty marry earner household pay
#> 0.6936 0.1484 0.1167 0.1102 0.1100 0.1001
#> income file zombie couple trip de
#> 0.0970 0.0970 0.0947 0.0908 -0.0559 -0.0569
#> travel bus snow sánchez madariaga interrupt
#> -0.0680 -0.0690 -0.0711 -0.0711 -0.0711 -0.0822
#> pedestrian transport
#> -0.0965 -0.1023
The following biplot associates the positions of the terms (points) and the positions of the document (arrows) in the LSA space for topics 4 and 5 computed above.
Here, we see that topic (i.e. dimension) 4 is positively associated with tech, body, data, politicians, female, and anti-associated with transport, leave, sexy.
The Latent Dirichlet Allocation is a generative model (i.e. Bayesian model) for topic modeling. LAD generates a Bag of Words model where the number of topics are pre-defined.
The following output displays the ten most frequent terms per topic and the ten most frequent topics per document.
The term-topic analysis computes the conditional probability for a term to be found given that it is assigned to a given topic. Then, for a given topic, the largest conditional probabilities give the terms that are the most associated with this topic.The advantage of LDA is that we obtain the probabilities in addition to the term-topic assignment.
The below output shows for each ten topics, the terms with the highest conditional probabilities. For example, this conditional probability for the term women is almost 1, indicating that women is very strongly associated with topic 5 and so that any document associated with topic 5 will have the term women in it. We also see that some topics are better defined than others. For example, topic 5 is the most well defined and topics 6, 7, 8 and 9 are not well defined.
To sum up, the tables below give the probabilities to select a term w given that the term comes form topic k.
The topic-document analysis provided the probabilities for a topic to be found in a document.
The below output shows that more than 60% of topics 2 is about text 13 and more than 60% of topic 3 is about text 14.
To dive a little bit deeper, we then look at the ten six longest chapters (i.e. documents).
[Improve layout: can we add chapter name instead of text ?]
Below we see that chapter 14 Women’s Rights Are Human Rights mainly talks about topic 3.
The topic distribution (i.e. proportion) in a corpus is called the topic prevalence.
The following output displays the prevalence scores for each topic using the thetas (i.e. topic-document probabilities). Here, there are 16 documents. We see that the most prevalent topic is topic 5 with 0.3722.
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> 0.0580 0.0834 0.0665 0.0655 0.0407 0.1368 0.0714 0.0385
#> topic9 topic10
#> 0.0477 0.3915
Topic modeling allows to organize, understand and summarize large corpora (plural of corpus). Yet, topic modeling has limitations especially regarding its interpretation of its outcomes. Therefore, some measures provide a way to extract more insight from it.
The measure of coherence allows to assess the quality of a topic (i.e. good versus bad).
The below output gives the coherence for each ten topic. The most coherent topic is topic 5 with a coherence of 0.428 and the least coherent topic is topic 9 with a coherence of -8.934.
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> -3.969 -4.524 -5.741 -4.677 -8.080 -2.851 -4.306 -9.174
#> topic9 topic10
#> -3.742 0.557
To verify we take a look at the co-document frequencies. Comparing the two below term-frequency matrices for topic 9 and 4, it is obvious that the top five terms in topic 5 are co-occurring more often in the same document than the top five terms in topic 9.
#> features
#> features female government mp election party
#> female 15 13 3 3 2
#> government 13 13 3 3 2
#> mp 3 3 2 3 2
#> election 3 3 3 2 2
#> party 2 2 2 2 2
#> features
#> features transport travel public plan trip
#> transport 2 3 4 5 2
#> travel 3 4 5 4 2
#> public 4 5 8 7 3
#> plan 5 4 7 6 3
#> trip 2 2 3 3 1
A topic is exclusive if it is associated with terms that are not associated to another topic.
The output below shows that the most exclusive topic (i.e. topic which as the more terms not associated with another topic) is topic 1 with an exclusivity measure of 0.65252, meaning that its five top terms are more specific to it, and the least exclusive one is topic 4 with 0.00280.
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> 0.75964 0.01084 0.01153 0.05086 0.00491 0.03552 0.00174 0.01656
#> topic9 topic10
#> 0.20167 0.02247
Embedding refers to the representation of elements (documents or tokens) in a Vector Space Model (VSM).
The idea behind word embedding based on co-occurrences is to reflect co-occurrences and not only Bag of Words (BoW).
To start with, we compute a feature co-occurrence symmetric matrix with a window of five. The matrix is very large (8,640x8,640) and displays the number of times terms (i.e. features) appear together in a window of five words.
From the feature co-occurrences matrix we compute two vector representations for a given word (i.e. feature). One representation for a word being the central term and the other one for a word being in the context. To then have a unique representation for a given word, we compute the average of the two representations.
The following output displays the unique representations of the vectors?.
We now plot the vectors of the 20 most used words.
To build the document embedding we compute the centroids of the documents. First, we need to extract the words in each document.
Then, for these words we extract the word vectors and make a matrix.
Finally, we average all these vectors.
Now we make the loop to apply the previous steps on all documents.
The goal of the supervised analysis is to re-classify its documents (i.e. chapters) using the features (i.e. terms) analyzed previously [add where maybe] [is that a good reformulation of the aim?]. To do so, we proceed to a machine learning approach consisting of splitting the corpus (data object) into a training and a test sets. Then, we train the classifiers on the training set and finally the best classifiers are selected on the test set.
To be able to proceed to the classification method described above, the corpus must be cleaned in order for the features (i.e. terms) to be usable. The cleaning process is achieved in the Data Structuring and Cleaning of the Data section and consists in sequential steps from the tokenization, removing useless words (i.e. stop words), lemmatization (i.e. tokens simplification) to stemming (i.e. reduce words to their stems).
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2 3 4 5 6
#> 1 18 0 0 0 0 1
#> 2 1 32 6 2 13 3
#> 3 2 6 17 2 4 1
#> 4 1 0 0 19 0 0
#> 5 0 0 0 0 13 0
#> 6 0 0 0 0 0 6
#>
#> Overall Statistics
#>
#> Accuracy : 0.714
#> 95% CI : (0.634, 0.786)
#> No Information Rate : 0.259
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.645
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity 0.818 0.842 0.739 0.826 0.4333
#> Specificity 0.992 0.771 0.879 0.992 1.0000
#> Pos Pred Value 0.947 0.561 0.531 0.950 1.0000
#> Neg Pred Value 0.969 0.933 0.948 0.969 0.8731
#> Prevalence 0.150 0.259 0.156 0.156 0.2041
#> Detection Rate 0.122 0.218 0.116 0.129 0.0884
#> Detection Prevalence 0.129 0.388 0.218 0.136 0.0884
#> Balanced Accuracy 0.905 0.806 0.809 0.909 0.7167
#> Class: 6
#> Sensitivity 0.5455
#> Specificity 1.0000
#> Pos Pred Value 1.0000
#> Neg Pred Value 0.9645
#> Prevalence 0.0748
#> Detection Rate 0.0408
#> Detection Prevalence 0.0408
#> Balanced Accuracy 0.7727
Throughout 16 chapters, women’s invisibility is raised from 16 different approaches revealing … * Take home message * Limitations * Future work?